SQL Server 2008 High Availability : Building Solutions with One or More HA Options

7/29/2011 5:54:09 PM

When you have the fundamental foundation in place, as described in the preceding section, you can move on to building a tailored software-driven high-availability solution. Which HA option(s) you should be using really depends on your HA requirements. The following high-availability options are used both individually and, very often, together to achieve different levels of HA:

Microsoft Cluster Services (non–SQL Server based)
SQL clustering
Data replication (including peer-to-peer configurations)
Log shipping
Database mirroring

All these options are readily available “out of the box” from Microsoft, from the Windows Server family of products and from Microsoft SQL Server 2008.

It is important to understand that some of these options can be used together, but not all go together. For example, you might use Microsoft Cluster Services (MSCS) along with Microsoft SQL Server 2008’s SQL Clustering to implement the SQL clustering database configuration, whereas, you wouldn’t necessarily need to use MSCS with database mirroring.

Microsoft Cluster Services (MSCS)

MSCS could actually be considered a part of the basic HA foundation components described earlier, except that it’s possible to build a high-availability system without it (for example, a system that uses numerous redundant hardware components and disk mirroring or RAID for its disk subsystem). Microsoft has made MSCS the cornerstone of its clustering capabilities, and MSCS is utilized by applications that are cluster enabled. A prime example of a cluster-enabled technology is Microsoft SQL Server 2008.

MSCS is the advanced Windows operating system configuration that defines and manages between 2 and 16 servers as “nodes” in a cluster. These nodes are aware of each other and can be set up to take over cluster-aware applications from any node that fails (for example, a failed server). This cluster configuration also shares and controls one or more disk subsystems as part of its high-availability capability. Figure 1 illustrates a basic two-node MSCS configuration.

Figure 1. Basic two-node MSCS configuration.

MSCS is available only with Microsoft Windows Enterprise Edition and Data Center operating system products. Don’t be alarmed, though. If you are looking at a high-availability system to begin with, there is a great probability that your applications are already running with these enterprise-level OS versions.

MSCS can be set up in an active/passive or active/active mode. Essentially, in an active/passive mode, one server sits idle (that is, is passive) while the other is doing the work (that is, is active). If the active server fails, the passive one takes over the shared disk and the cluster-aware applications instantaneously.

SQL Clustering

If you want a SQL Server instance to be clustered for high availability, you are essentially asking that this SQL Server instance (and the database) be completely resilient to a server failure and completely available to the application without the end user ever even noticing that there was a failure (or at least with minimal interruption). Microsoft provides this capability through the SQL Clustering option. SQL Clustering is built on top of MSCS for its underlying detection of a failed server and for its availability of the databases on the shared disk (which is controlled by MSCS). SQL Server is said to be a “cluster-aware/enabled” technology.

A SQL Server instance that is clustered can be created by actually creating a virtual SQL Server instance that is known to the application (the constant in the equation) and then two physical SQL Server instances that share one set of databases. In an active/passive configuration, only one SQL Server instance is active at a time and just goes along and does its work. If that active server fails (and with it, the physical SQL Server instance), the passive server (and the physical SQL Server instance on that server) simply takes over instantaneously. This is possible because MSCS also controls the shared disk where the databases are. The end user and application never really know which physical SQL Server instance they are on or whether one failed. Figure 2 illustrates a typical SQL Clustering configuration built on top of MSCS.

Figure 2. Basic SQL Clustering two-node configuration (active/passive).

Setup and management of this type of configuration are much easier than you might think. More and more often, SQL Clustering is the method chosen for most high-availability solutions.

Extending the clustering model to include Network Load Balancing (NLB) pushes this particular solution even further into higher availability—from client traffic high availability to back-end SQL Server high availability. Figure 3 shows a four-host NLB cluster architecture acting as a virtual server to handle the network traffic coupled with a two-node SQL cluster on the back end. This setup is resilient from top to bottom.

Figure 3. An NLB host cluster with a two-node server cluster.

The four NLB hosts work together, distributing the work efficiently. NLB automatically detects the failure of a server and repartitions client traffic among the remaining servers.

The following apply to SQL Clustering in SQL Server 2008:

Full SQL Server 2008 Services as cluster-managed resources— All SQL Server 2008 services, including the following, are cluster aware:
- SQL Server DBMS engine
- SQL Server Agent
- SQL Server Full-Text Search
- Analysis Services
- Integration Services
- Notification Services
- Reporting Services
- Service Broker

Now, you can extend this fault-tolerant solution to embrace more SQL Server instances and all of SQL Server’s related services. This is a big deal because things like Analysis Services previously had to be handled with separate techniques to achieve near high availability. Not anymore; each SQL Server service is now cluster aware.

Data Replication

The next technology option that can be utilized to achieve high availability is data replication. Originally, data replication was created to offload processing from a very busy server (such as an OLTP application that must also support a big reporting workload) or to geographically distribute data for different, very distinct user bases (such as worldwide product ordering applications). As data replication (transactional replication) became more stable and reliable, it started to be used to create “warm” (almost “hot”) standby SQL Servers that could also be used to fulfill basic reporting needs. If the primary server ever failed, the reporting users would still be able to work (hence a higher degree of availability achieved for them), and the replicated reporting database could be used as a substitute for the primary server, if needed (hence a warm-standby SQL Server). When doing transactional replication in the “instantaneous replication” mode, all data changes were replicated to the replicate servers extremely quickly. With SQL Server 2000, updating subscribers allowed for even greater distribution of the workload and, overall, increased the availability of the primary data and distributed the update load across the replication topology. There are plenty of issues and complications involved in using the updating subscribers approach (for example, conflict handlers, queues).

With SQL Server 2005, Microsoft introduced peer-to-peer replication, which is not a publisher/subscription model, but a publisher-to-publisher model (hence peer-to-peer). It is a lot easier to configure and manage than other replication topologies, but it still has its nuances to deal with. This peer-to-peer model allows excellent availability for this data and great distribution of workload along geographic (or other) lines. This may fit some companies’ availability requirements and also fulfill their distributed reporting requirements as well.

The top of Figure 4 shows a typical SQL data replication configuration of a central publisher/subscriber using continuous transactional replication. This can serve as a basis for high availability and also fulfills a reporting server requirement at the same time. The bottom of Figure 4 shows a typical peer-to-peer continuous transactional replication model that is also viable.

Figure 4. Basic data replication configurations for HA.

The downside of peer-to-peer replication comes into play if ever the subscriber (or the other peer) needs to become the primary server (that is, take over the work from the original server). This takes a bit of administration that is not transparent to the end user. Connection strings have to be changed, ODBC data sources need to be updated, and so on. But this process may take minutes as opposed to hours of database recovery time, and it may well be tolerable to end users. Peer-to-peer configurations handle recovery a bit better in that much of the workload is already distributed to either of the nodes. So, at most, only part of the user base will be affected if one node goes down. Those users can easily be redirected to the other node (peer), with the same type of connection changes described earlier.

With either the publisher/subscriber or peer-to-peer replication approach, there is a risk of not having all the transactions from the publishing server. However, often, a company is willing to live with this small risk in favor of availability. Remember that a replicated database is an approximate image of the primary database (up to the point of the last update that was successfully distributed), which makes it very attractive as a warm standby. For publishing databases that are primarily read-only, using a warm standby is a great way to distribute the load and mitigate the risk of any one server failing.

Log Shipping

Another, more direct, method of creating a completely redundant database image is to utilize log shipping. Microsoft “certifies” log shipping as a method of creating an “almost hot” spare. Some folks even use log shipping as an alternative to data replication (it has been referred to as “the poor man’s data replication”). There’s just one problem: Microsoft has formally announced that log shipping (as we know and love it) will be deprecated in the near future. The reasons are many, but the primary one is that it is being replaced by database mirroring (referred to as real-time log shipping, when it was first being conceived). If you still want to use log shipping, it is perfectly viable—for now.

Log shipping does three primary things:

Makes an exact image copy of a database on one server from a database dump
Creates a copy of that database on one or more other servers from that dump
Continuously applies transaction log dumps from the original database to the copy

In other words, log shipping effectively replicates the data of one server to one or more other servers via transaction log dumps. Figure 5 shows a source/destination SQL Server pair that has been configured for log shipping.

Figure 5. Log shipping in support of high availability.

Log shipping is a great solution when you have to create one or more failover servers. It turns out that, to some degree, log shipping fits the requirement of creating a read-only subscriber as well. The following are the gating factors for using log shipping as a method of creating and maintaining a redundant database image:

Data latency lag is the time that exists between the transaction log dumps on the source database and when these dumps are applied to the destination databases.
Sources and destinations must be the same SQL Server version.
Data is read-only on the destination SQL Server until the log shipping pairing is broken (as it should be to guarantee that the transaction logs can be applied to the destination SQL Server).

The data latency restriction might quickly disqualify log shipping as an instantaneous high-availability solution (if you need rapid availability of the failover server). However, log shipping might be adequate for certain situations. If a failure ever occurs on the primary SQL Server, a destination SQL Server that was created and maintained via log shipping can be swapped into use fairly quickly. The destination SQL Server would contain exactly what was on the source SQL Server (right down to every user ID, table, index, and file allocation map, except for any changes to the source database that occurred after the last log dump was applied). This directly achieves a level of high availability. It is still not completely transparent, though, because the SQL Server instance names are different, and the end user may be required to log in again to the new server instance.

Database Mirroring

Another failover option with SQL Server is database mirroring. Database mirroring essentially extends the old log shipping feature of SQL Server and creates an automatic failover capability to a “hot” standby server. Database mirroring is being billed as creating a fault-tolerant database that is an “instant” standby (ready for use in less than three seconds).

At the heart of database mirroring is the “copy-on-write” technology. Copy-on-write means that transactional changes are shipped to another server as the logs are written. All logged changes to the database instance become immediately available for copying to another location. As you can see in Figure 6 , database mirroring utilizes a witness server as well as client components to insulate the client applications from any knowledge of a server failure.

Figure 6. SQL Server 2008 database mirroring high-availability configuration.

Combining Failover with Scale-Out Options

SQL Server 2008 pushes combinations of options to achieve higher availability levels. A prime example would be combining data replication with database mirroring to provide maximum availability of data, scalability to users, and fault tolerance via failover, potentially at each node in the replication topology. By starting with the publisher and perhaps the distributor, you make them both database mirror failover configurations.

Building up a combination of both options together is essentially the best of both worlds: the super-low latency of database mirroring for fault tolerance and high availability (and scalability) of data through replication.